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1 Introduction 

The importance of modeling speech articulation for high-quality 
audiovisual (AV) speech synthesis is widely acknowledged. Never- 
theless, while state-of-the-art, data-driven approaches to facial an- 
imation can make use of sophisticated motion capture techniques, 
the animation of the intraoral articulators (viz. the tongue, jaw, and 
velum) typically makes use of simple rules or viseme morphing, in 
stark contrast to the otherwise high quality of facial modeling. Us- 
ing appropriate speech production data could significantly improve 
the quality of articulatory animation for AV synthesis. 

2 Articulatory animation 

To complement a purely data-driven AV synthesizer employing bi- 
modal unit- selection [Musti et al. 2011], we have implemented a 
framework for articulatory animation [Steiner and Ouni 2012] us- 
ing motion capture of the hidden articulators obtained through elec- 
tromagnetic articulography (EMA) [Hoole and Zierdt 2010]. One 
component of this framework compiles an animated 3D model of 
the tongue and teeth as an asset usable by downstream components 
or an external 3D graphics engine. This is achieved by rigging static 
meshes with a pseudo- skeletal armature, which is in turn driven by 
the EMA data through inverse kinematics (IK). Subjectively, we 
find the resulting animation to be both plausible and convincing. 
However, this has not yet been formally evaluated, and so the moti- 
vation for the present paper is to conduct an objective analysis. 

3 Multimodal speech production data 

The mnguO articulatory corpus 1 contains a large set of 3D EMA 
data [Richmond et al. 201 1] from a male speaker of British English, 
as well as volumetric magnetic resonance imaging (MRI) scans 
of that speaker's vocal tract during sustained speech production 
[Steiner et al. 2012]. Using the articulatory animation framework, 
static meshes of dental cast scans and the tongue (extracted from the 
MRI subset of the mnguO corpus) can be animated using motion 
capture data from the EMA subset, providing a means to evaluate 
the synthesized animation on the generated model (Figure 1). 

4 Evaluation 

In order to analyze the degree to which the animated articulators 
match the shape and movements captured by the natural speech pro- 
duction data, several approaches are described. 

• The positions and orientations of the IK targets are dumped to 
data files in a format compatible with that of the 3D articulo- 
graph. This allows visualization and direct comparison of the 
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Figure 1: Animated articulatory model in bind pose, with and with- 
out maxilla; EMA coils rendered as spheres. 


animation with the original EMA data, using external analysis 
software. 

• The distances of the EMA-controlled IK targets to the sur- 
faces of the animated articulators should ideally remain close 
to zero during deformation. Likewise, there should be colli- 
sion with a reconstructed palate surface, but no penetration. 

• A tongue mesh extracted from a volumetric MRI scan in the 
mnguO data, when deformed to a pose corresponding to a 
given phoneme, should assume a shape closely resembling 
the vocal tract configuration in the corresponding volumetric 
scan. 

These evaluation approaches are implemented as unit and integra- 
tion tests in the corresponding phases of the model compiler's build 
lifecycle, automatically producing appropriate reports by which the 
naturalness of the articulatory animation may be assessed. 
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